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ABSTRACT 

Recently datasets that contain sentence descriptions of im¬ 
ages have enabled models that can automatically generate 
image captions. However, collecting these datasets are still 
very expensive. Here, we present SentenceRacer, an online 
game that gathers and verifies descriptions of images at no 
cost. Similar to the game hangman, players compete to un¬ 
cover words in a sentence that ultimately describes an image. 
SentenceRacer both generates and verifies that the sentences 
are accurate descriptions. We show that SentenceRacer gen¬ 
erates annotations of higher quality than those generated on 
Amazon Mechanical Turk (AMT). 


INTRODUCTION 

The ability of describing images with sentences has numer¬ 
ous applications like helping the visually impaired indepen¬ 
dently browse the Internet. Recently, with competitions like 
Microsoft coco’s Image Captioning [2], there has been an 
increased interest in the task of automatic image description 
generation [4]. With this interest, there is a dire need for large 
scale datasets that can be used for training these sentence gen¬ 
eration models. Datasets like COCO [2] and FlickrSOM [6] 
have been collected by crowdsourcing the description task to 
human workers on Amazon Mechanical Turk. Once the sen¬ 
tences are generated by one crowd worker, both [2, 6] send 
their sentences to additional crowd workers to verify the ac¬ 
curacy of the sentences. The biggest bottleneck in growing 
these datasets to a much larger scale has been the cost of gen¬ 
erating these sentences and verifying their accuracy. 

Humans are strikingly proficient at ’’filling in the blanks” — 
whether it be crosswords, hangman, or Wheel of Fortune. We 
enjoy partially filled puzzles because of the feeling of simul¬ 
taneously knowing and not knowing the full answer [3]. Pre¬ 
vious research by von Ahn and Dabbish show gamification to 
be an effective vehicle for labeling images in datasets for free 
[5]. However, such games were only limited to single word 
annotations [5]. This paper explores another game mecha¬ 
nism in order to generate more complex, full-sentence anno¬ 
tations. We propose a gamification method to crowdsource 
sentence annotations of images by having players write sen¬ 
tences for other players to guess. Additionally, we show that 
there exists a direct correlation between the accuracy of the 
sentence description and a player’s ability to guess the sen¬ 
tence. 


* These authors contributed equally to the publication. 


Sherman: 4 
ranjay: 2 
kenji: 1 

Sherman has set the sentence! 
kenji horses 

Congrats - you correctly guessed 'horses'! 
ranjay grass 
kenji farm 

ranjay correctly guessed 'field'! 
ranjay field 

two horses are_in a field touching 

22 


Figure 1. A screenshot of SentenceRacer’s interface. The left side dis¬ 
plays the image and the state of the verified words so far. The right side 
displays the chat interface and the scoreboard for guessing. 

Motivated to reduce the cost of collecting a large image cap¬ 
tioning dataset, we present a game that achieves the follow¬ 
ing: 

1. Generates sentence descriptions of images 

2. Verifies that these sentences are accurate 

3. Captures sentences of better quality than those collected by 
Amazon Mechanical Turk (AMT). 

SYSTEM DESCRIPTION 

SentenceRacer is played with a minimum of three players. 
Each round of the game rotates a leader position. The leader 
sets a sentence describing the image for other players to 
guess. After eliminating stop words from the sentence, we 
allow all players to see guesses made by other players while 
blocking the leader’s communication with the guessers. Play¬ 
ers only have limited time to guess the words set by the leader. 
A correct guess rewards both the guesser and the leader with 
points and reveals the guess’ position in the sentence. This re¬ 
ward system implicitly motivates the leader to write descrip¬ 
tive sentences about the image, as they will be easily guessed 
by the other players. 

DATA AND ANALYSIS 

To gather the data, we took ten groups of four volunteers and 
ran each group on the same ten randomly sampled images 
from Microsoft’s COCO dataset [2]. 

People Find SentenceRacer to be more fun 

Qualitative results suggest that SentenceRacer is more fun in 
comparison to the task of image captioning. Surveys com¬ 
paring SentenceRacer and a standard AMT image captioning 





task show that players find SentenceRacer more fun and en¬ 
gaging particularly because of the social and fast-paced as¬ 
pects of the game. 

SentenceRacer’s Sentences are Confirmed by AMT 

Sentences collected by SentenceRacer were sent to AMT for 
verification by three crowd workers. A sentence is verified 
by AMT if at least two out of the three workers agree that 
the sentence accurately describes the image. A sentence is 
considered verified by SentenceRacer if all the words in the 
sentence can be guessed by the players. We found that 87.8% 
of sentences verified by SentenceRacer were also verified by 
AMT, while only 54.9% of the sentences not verified by Sen¬ 
tenceRacer were verified by AMT. We also found that the sen¬ 
tences collected from SentenceRacer have a higher percent¬ 
age of verified sentences (87.8%) than those collected from 
AMT workers (85.5%) on the same images. 

We investigated the relation of the percentage of sentences 
verified with the number of remaining blanks left in the game. 
Table 2 shows that the percentage verified increases as the 
number of blanks decreases. The tail end of the blanks is 
sparse, causing the data to have high variance. However, we 
believe that this trend still shows that SentenceRacer’s verifi¬ 
cation process adequately determines whether a sentence ac¬ 
curately describes an image. The number of blank spaces are 
directly correlated with how likely a sentence will be verified. 


Source 

Total Blanks 

# Sentences 

Verified(%) 


4 

7 

42.80 


3 

12 

50.00 

SentenceRacer 

2 

9 

33.30 


1 

12 

75.00 


0 

49 

87.80 

AMT 

- 

200 

85.50 


Table 1. Correlation between number of blanks and percentage of ver¬ 
ified sentences. As the number of blanks decreases, the percentage of 
sentences verified by AMT increases. Also in comparison, sentences col¬ 
lected from AMT have a lower verification percentage than the sentences 
collected by SentenceRacer. 

SentenceRacer has Higher Sentence Quality than AMT 

Figure 2 shows the quality of some sentences we received 
from both tasks on AMT and from playing SentenceRacer. 
We measure sentence quality by the amount of information 
we can extract from the sentence describing an image. The 
average number of objects, object attributes, and pairwise- 
object relationships per sentence is a basic indicator of sen¬ 
tence quality [1]. Table 2 shows that SentenceRacer has sta- 

AMT 

The kitchen is very sophisticated and modern. 

The doubie sink is freshiy poiished chrome. 

Two stoois are next to the bar. 

SR 

A dean white tabie in the middie of a iarge kitchen. 
A siiver sink is on a white granite countertop. 

Two white chairs are under a white tabietop. 

Figure 2. Comparing verified sentences from AMT and SentenceRacer. 

tistically significant more objects and relationships and may 
suggest that SentenceRacer provides more attributes as well. 


We believe SentenceRacer’s sentence quality stems from the 
rule that correct guesses reward both the guesser and the 
leader. Players are incentivized to write longer sentences, 
leading to higher averages of objects, relationships, and at¬ 
tributes, than AMT tasks, where this incentive is absent. 



Objects 

Relationships 

Attributes 

AMT 

(n=200) 

2.30 

1.02 

1.17 

SentenceRacer 

(n=49) 

2.98 

1.88 

1.45 

P-values 

<0.01 

< 0.001 

0.1 


Table 2. T-test showing that increased number of objects, relationships, 
and (potentially) attributes. 


CONCLUSION 

In this paper, we demonstrate how SentenceRacer is able to 
collect sentences describing images, an expensive task for 
Computer Vision research. This system introduces the idea of 
collecting and verifying sentences through a game that uses 
contextual cues as a means of entertainment and verification. 
Our evaluations suggest that this game is more enjoyable than 
standard methods of collecting sentences. SentenceRacer can 
also simultaneously perform the collection and verification 
of sentences. Finally, we show that the sentences collected 
by SentenceRacer are of higher quality than those collected 
by AMT. 

We hope to explore how a list of taboo words may increase 
the diversity of the sentences collected. We also hope to in¬ 
vestigate creative ways of using the waiting period between 
rounds to possibly attempt crowdsourcing other tasks such as 
grounding objects in the sentence within the image itself. 
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